fix(fetch): detect encoding for non-UTF-8 pages using charset-normalizer by olegsa · Pull Request #3880 · modelcontextprotocol/servers

olegsa · 2026-04-09T07:36:40Z

Summary

Add automatic character encoding detection to mcp-server-fetch using charset-normalizer for pages that don't declare charset in the HTTP Content-Type header
Introduces get_response_text() helper that checks response.charset_encoding first, then falls back to statistical byte analysis via charset-normalizer
Fixes garbled text when fetching pages served in non-UTF-8 encodings (e.g. windows-1251, windows-1255, windows-1256, euc-kr) without a charset declaration in the HTTP header

Motivation

Many websites (especially non-English ones) serve content in legacy encodings like windows-1255 (Hebrew), windows-1251 (Cyrillic), windows-1256 (Arabic), or euc-kr (Korean) without declaring the charset in the HTTP Content-Type header. The current code uses response.text which defaults to UTF-8, producing garbled/mojibake output for these pages.

Changes

server.py: Added get_response_text() that uses charset-normalizer for encoding detection when HTTP headers lack charset info. Replaced response.text with get_response_text(response) in fetch_url(). Also moved httpx from local imports to a top-level import.
pyproject.toml: Added charset-normalizer>=3.0.0 as an explicit dependency (already a transitive dep via requests).
tests/test_server.py: Added TestGetResponseText class with 5 tests covering UTF-8 passthrough, Ukrainian (windows-1251), Hebrew (windows-1255), Arabic (windows-1256), and Korean (euc-kr) encoding detection.

Test plan

All existing tests pass
New encoding detection tests pass for Ukrainian, Hebrew, Arabic, Korean
UTF-8 pages with charset in HTTP header still use the standard path
Non-HTML content (JSON, etc.) is unaffected

Made with Cursor

…d refactor HTTP client usage

…tests Add charset-normalizer as an explicit dependency in pyproject.toml and add tests verifying correct decoding of non-UTF-8 pages (Ukrainian windows-1251, Hebrew windows-1255, Arabic windows-1256, Korean euc-kr). Made-with: Cursor

Made-with: Cursor

olaservo

Nice fix — the get_response_text() helper is well-structured with a conservative fallback chain (charset header → charset-normalizer detection → default). Good test coverage across multiple encodings.

One nit: test_korean_euc_kr actually encodes the text as UTF-8, not euc-kr. Consider renaming it to test_utf8_without_charset_header, or changing it to actually encode as euc-kr to match the test name.

Otherwise LGTM.

This review was assisted by Claude Code.

The test was encoding the Korean text as UTF-8, so it did not exercise the charset-normalizer detection path. Encode with euc-kr and extend the sample text so statistical detection has enough bytes to work with. Addresses review feedback on modelcontextprotocol#3880. Made-with: Cursor

olegsa · 2026-04-20T17:16:04Z

Thanks for the review, @olaservo! Good catch on the euc-kr test — you're right, it was silently encoding as UTF-8. Fixed in 37da471: the test now actually calls text.encode("euc-kr") and uses a longer sample so charset-normalizer's statistical detection has enough bytes to work with. This gives us genuine coverage of the detection path for Korean rather than a duplicate of test_utf8_passthrough. All 5 TestGetResponseText tests still pass locally.

cliffhall · 2026-04-21T14:30:58Z

@claude review

claude · 2026-04-21T14:31:16Z

Claude finished @cliffhall's task in 1m 49s —— View job

PR Review: fix(fetch): detect encoding for non-UTF-8 pages

Tasks:

This is a solid, well-scoped fix. The fallback chain in get_response_text() is sensible and the previous reviewer's concern about test_korean_euc_kr has been addressed. A few observations follow.

`server.py`

get_response_text() (lines 29–38) — looks correct

The three-level fallback chain is appropriate:

Declared charset in HTTP header → use httpx's native decoding (response.text)
No declared charset → run charset_normalizer.from_bytes statistical detection
Detection fails → fall back to response.text (httpx applies its own heuristics)

response.charset_encoding returns None only when there is no charset= parameter in the Content-Type header, which is exactly the condition you want. API usage of from_bytes(...).best() and str(result) is correct.

Inconsistency: robots.txt still uses response.text directly (line 106)

# check_may_autonomously_fetch_url — line 106
robot_txt = response.text

This is probably fine in practice (robots.txt is defined to be ASCII), but it's a minor inconsistency with the new approach. A short comment explaining why the standard path is acceptable here would add clarity, or you could use get_response_text(response) for uniformity. Fix this →

TestFetchUrl mocks are unintentionally compatible with the new code

The existing TestFetchUrl tests use MagicMock() responses. MagicMock().charset_encoding auto-creates a truthy MagicMock object (not None), so get_response_text takes the first branch and calls response.text — which the tests have explicitly set. This works, but it's fragile. If charset_encoding were ever set to None in one of those mocks, get_response_text would then call from_bytes(response.content) on another MagicMock, likely producing surprising results. Not a blocker, but worth noting.

`tests/test_server.py`

Using real httpx.Response objects in TestGetResponseText rather than mocks is exactly the right call — it exercises the actual encoding/decoding path with no stubbing. The _build_response helper is clean.

After the fix in 37da471, test_korean_euc_kr now properly calls .encode("euc-kr") and uses a sufficiently long sample for statistical detection. All five encoding scenarios are covered with realistic, semantically meaningful content.

`pyproject.toml`

requests is now an unused dependency

requests>=2.32.3 is listed in dependencies, but import requests does not appear anywhere in the server code. The PR description notes charset-normalizer was "already a transitive dep via requests" — now that it's a direct dependency, requests can be removed to reduce the dependency footprint. Fix this →

Summary

Area	Status
Core logic (`get_response_text`)	✅ Correct
Encoding test coverage	✅ Good (5 cases, real httpx objects)
euc-kr fix from prior review	✅ Addressed
robots.txt consistency	⚠️ Minor inconsistency
`requests` dependency	⚠️ Appears unused, could be removed
`TestFetchUrl` mock fragility	💬 Nit

The two items worth addressing before merge are removing the unused requests dependency and the robots.txt inconsistency; everything else is a minor nit. Good work on the fix overall.

olegsa added 4 commits April 9, 2026 10:32

feat(fetch): enhance HTTP response handling with charset detection an…

578c61a

…d refactor HTTP client usage

chore(fetch): regenerate uv.lock after adding charset-normalizer

50995f8

Made-with: Cursor

Update revision number in uv.lock file

a754e89

olaservo reviewed Apr 12, 2026

View reviewed changes

olegsa added 2 commits April 20, 2026 20:10

Merge branch 'main' into fix/fetch-encoding-detection

07c7846

cliffhall added bug Something isn't working server-fetch Reference implementation for the Fetch MCP server - src/fetch labels Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(fetch): detect encoding for non-UTF-8 pages using charset-normalizer#3880

fix(fetch): detect encoding for non-UTF-8 pages using charset-normalizer#3880
olegsa wants to merge 6 commits intomodelcontextprotocol:mainfrom
olegsa:fix/fetch-encoding-detection

olegsa commented Apr 9, 2026

Uh oh!

olaservo left a comment

Uh oh!

olegsa commented Apr 20, 2026

Uh oh!

cliffhall commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

olegsa commented Apr 9, 2026

Summary

Motivation

Changes

Test plan

Uh oh!

olaservo left a comment

Choose a reason for hiding this comment

Uh oh!

olegsa commented Apr 20, 2026

Uh oh!

cliffhall commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(fetch): detect encoding for non-UTF-8 pages

server.py

tests/test_server.py

pyproject.toml

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude Bot commented Apr 21, 2026 •

edited

Loading

`server.py`

`tests/test_server.py`

`pyproject.toml`